Search CORE

171 research outputs found

Preventing False Discovery in Interactive Data Analysis is Hard

Author: Hardt Moritz
Ullman Jonathan
Publication venue
Publication date: 06/08/2014
Field of study

We show that, under a standard hardness assumption, there is no computationally efficient algorithm that given

n

samples from an unknown distribution can give valid answers to

n^{3+o(1)}

adaptively chosen statistical queries. A statistical query asks for the expectation of a predicate over the underlying distribution, and an answer to a statistical query is valid if it is "close" to the correct expectation over the distribution. Our result stands in stark contrast to the well known fact that exponentially many statistical queries can be answered validly and efficiently if the queries are chosen non-adaptively (no query may depend on the answers to previous queries). Moreover, a recent work by Dwork et al. shows how to accurately answer exponentially many adaptively chosen statistical queries via a computationally inefficient algorithm; and how to answer a quadratic number of adaptive queries via a computationally efficient algorithm. The latter result implies that our result is tight up to a linear factor in

n.

Conceptually, our result demonstrates that achieving statistical validity alone can be a source of computational intractability in adaptive settings. For example, in the modern large collaborative research environment, data analysts typically choose a particular approach based on previous findings. False discovery occurs if a research finding is supported by the data but not by the underlying distribution. While the study of preventing false discovery in Statistics is decades old, to the best of our knowledge our result is the first to demonstrate a computational barrier. In particular, our result suggests that the perceived difficulty of preventing false discovery in today's collaborative research environment may be inherent

arXiv.org e-Print Archive

CiteSeerX

Crossref

Tight Lower Bounds for Differentially Private Selection

Author: Steinke Thomas
Ullman Jonathan
Publication venue
Publication date: 10/04/2017
Field of study

A pervasive task in the differential privacy literature is to select the

k

items of "highest quality" out of a set of

d

items, where the quality of each item depends on a sensitive dataset that must be protected. Variants of this task arise naturally in fundamental problems like feature selection and hypothesis testing, and also as subroutines for many sophisticated differentially private algorithms. The standard approaches to these tasks---repeated use of the exponential mechanism or the sparse vector technique---approximately solve this problem given a dataset of

n = O(\sqrt{k}\log d)

samples. We provide a tight lower bound for some very simple variants of the private selection problem. Our lower bound shows that a sample of size

n = \Omega(\sqrt{k} \log d)

is required even to achieve a very minimal accuracy guarantee. Our results are based on an extension of the fingerprinting method to sparse selection problems. Previously, the fingerprinting method has been used to provide tight lower bounds for answering an entire set of

d

queries, but often only some much smaller set of

k

queries are relevant. Our extension allows us to prove lower bounds that depend on both the number of relevant queries and the total number of queries

arXiv.org e-Print Archive

Crossref

Differential Privacy for the Analyst via Private Equilibrium Computation

Author: Hsu Justin
Roth Aaron
Ullman Jonathan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

We give new mechanisms for answering exponentially many queries from multiple analysts on a private database, while protecting differential privacy both for the individuals in the database and for the analysts. That is, our mechanism's answer to each query is nearly insensitive to changes in the queries asked by other analysts. Our mechanism is the first to offer differential privacy on the joint distribution over analysts' answers, providing privacy for data analysts even if the other data analysts collude or register multiple accounts. In some settings, we are able to achieve nearly optimal error rates (even compared to mechanisms which do not offer analyst privacy), and we are able to extend our techniques to handle non-linear queries. Our analysis is based on a novel view of the private query-release problem as a two-player zero-sum game, which may be of independent interest

arXiv.org e-Print Archive

Crossref